CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

نویسندگان

  • Kunal Chakma
  • Amitava Das
چکیده

Social media has become almost ubiquitous in present times. Such proliferation leads to automatic information processing need and has various challenges. The nature of social media content is mostly informal. Additionally while talking about Indian social media, users often prefer to use Roman transliterations of their native languages and English embedding. Therefore Information retrieval (IR) on such Indian social media data is a challenging and difficult task when the documents and the queries are a mixture of two or more languages written in either the native scripts and/or in the Roman transliterated form. Here in this paper we have emphasized issues related with Information Retrieval (IR) for Code-Mixed Indian social media texts, particularly texts from twitter. We describe a corpus collection process, reported limitations of available state-of-the-art IR systems on such data and formalize the problem of Code-Mixed Information Retrieval on informal texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CEN@Amrita FIRE 2016: Context based Character Embeddings for Entity Extraction in Code-Mixed Text

This paper presents the working methodology and results on Code Mix Entity Extraction in Indian Languages (CMEE-IL) shared the task of FIRE-2016. The aim of the task is to identify various entities such as a person, organization, movie and location names in a given code-mixed tweets. The tweets in code mix are written in English mixed with Hindi or Tamil. In this work, Entity Extraction system ...

متن کامل

Exploiting Named Entity Mentions Towards Code Mixed IR : Working Notes for the UB system submission for MSIR@FIRE-2016

A sizable percentage of online user generated content is susceptible to code switching and code mixing owing to a variety of reasons. Thus, an expected consequence is that adhoc user queries on such data are also inherently code mixed. This paper thus presents our solution for a similar scenario : information retrieval on code mixed Hindi-English tweets. We explore techniques in information ext...

متن کامل

A Hindi-English Code-Switching Corpus

The aim of this paper is to investigate the rules and constraints of code-switching (CS) in Hindi-English mixed language data. In this paper, we’ll discuss how we collected the mixed language corpus. This corpus is primarily made up of student interview speech. The speech was manually transcribed and verified by bilingual speakers of Hindi and English. The code-switching cases in the corpus are...

متن کامل

Aggression-annotated Corpus of Hindi-English Code-mixed Data

As the interaction over the web has increased, incidents of aggression and related events like trolling, cyberbullying, flaming, hate speech, etc. too have increased manifold across the globe. While most of these behaviour like bullying or hate speech have predated the Internet, the reach and extent of the Internet has given these an unprecedented power and influence to affect the lives of bill...

متن کامل

Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data

In this paper, we propose efficient and less resource-intensive strategies for parsing of code-mixed data. These strategies are not constrained by in-domain annotations, rather they leverage pre-existing monolingual annotated resources for training. We show that these methods can produce significantly better results as compared to an informed baseline. Besides, we also present a data set of 450...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computación y Sistemas

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2016